import pandas as pd
from datetime import datetime
def parse(x):
return datetime.strptime(x,"%m/%d/%Y")
df=pd.read_csv("https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/amazon_revenue_profit.csv",parse_dates=['Quarter'],date_parser=parse)
df.head()
| Quarter | Revenue | Net Income | |
|---|---|---|---|
| 0 | 2020-03-31 | 75452 | 2535 |
| 1 | 2019-12-31 | 87437 | 3268 |
| 2 | 2019-09-30 | 69981 | 2134 |
| 3 | 2019-06-30 | 63404 | 2625 |
| 4 | 2019-03-31 | 59700 | 3561 |
amazon_df=df.set_index("Quarter")
amazon_df.head()
| Revenue | Net Income | |
|---|---|---|
| Quarter | ||
| 2020-03-31 | 75452 | 2535 |
| 2019-12-31 | 87437 | 3268 |
| 2019-09-30 | 69981 | 2134 |
| 2019-06-30 | 63404 | 2625 |
| 2019-03-31 | 59700 | 3561 |
import plotly.express as px
fig=px.line(df,x="Quarter",y="Revenue",title="Amazon Revenue Slider")
fig.show()
# Same plot & Splitting the data 1Y,2Y,3Y like this
fig=px.line(df,x="Quarter",y="Revenue",title="Amazon Revenue Slider")
fig.update_xaxes(
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=1,label="1Year",step="year",stepmode="backward"),
dict(count=2,label="3Year",step="year",stepmode="backward"),
dict(count=3,label="5Year",step="year",stepmode="backward"),
dict(step="all")
])
)
)
fig.show()
now see this graph seasonality increase very constant till 2009, but 2010 slidely higher pick then it has been constant , again 2014 pick has been higher ,and similarlly pick has been incresing over time . If you see 2006 to 2010 time series is mostly stationary , but aftere that it not being stationary
Now unserstand data stationary or not
NULL: Data Stationary
Alternate Hypothesis: Data not stationary
#kpss test
from statsmodels.tsa.stattools import kpss
What the KPSS test does said ?
it helps us to determining time series is stationary arround a mean or linear trend
tstest=kpss(amazon_df["Revenue"],"ct")
tstest
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tsa\stattools.py:2018: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is smaller than the p-value returned.
(0.30665545975169556,
0.01,
4,
{'10%': 0.119, '5%': 0.146, '2.5%': 0.176, '1%': 0.216})
Test statistic-0.3066
0.176>2.5 ---NULL HYPOTHESIS rejected
so, our data is not stationary
Now statsmodels package doing seasonal decompose,( beacuse we have seasonality). seasonal decomposed help us to determine what model we want additive/multiplicative . if your data set is stationary then we will use Additive model , when not stationary then we will use Multiplicative model
import statsmodels.api as sm
res=sm.tsa.seasonal_decompose(amazon_df['Revenue'],model="multiplicative")
resplot=res.plot()
2nd picture - There is a incresing trend .. 3rd picture- we know our data is seasonality , so there is a seasonal component ... finally residual which is error term / noise term that is kind of difference from the observed and trend value and seasonal value .
Now print the observed value(it is my actual data)
res.observed
Quarter
2020-03-31 75452.0
2019-12-31 87437.0
2019-09-30 69981.0
2019-06-30 63404.0
2019-03-31 59700.0
...
2006-03-31 2279.0
2005-12-31 2977.0
2005-09-30 1858.0
2005-06-30 1753.0
2005-03-31 1902.0
Name: Revenue, Length: 61, dtype: float64
print(res.trend)# print trend component
Quarter
2020-03-31 NaN
2019-12-31 NaN
2019-09-30 72099.500
2019-06-30 68248.750
2019-03-31 64691.375
...
2006-03-31 2369.375
2005-12-31 2265.000
2005-09-30 2169.625
2005-06-30 NaN
2005-03-31 NaN
Name: trend, Length: 61, dtype: float64
print(res.seasonal)
Quarter
2020-03-31 0.941840
2019-12-31 1.289518
2019-09-30 0.894993
2019-06-30 0.873649
2019-03-31 0.941840
...
2006-03-31 0.941840
2005-12-31 1.289518
2005-09-30 0.894993
2005-06-30 0.873649
2005-03-31 0.941840
Name: seasonal, Length: 61, dtype: float64
res.resid
Quarter
2020-03-31 NaN
2019-12-31 NaN
2019-09-30 1.084496
2019-06-30 1.063372
2019-03-31 0.979831
...
2006-03-31 1.021253
2005-12-31 1.019256
2005-09-30 0.956844
2005-06-30 NaN
2005-03-31 NaN
Name: resid, Length: 61, dtype: float64
finally i have all the value trend, seasonal,residual value , if i multiply all the three, we will get the original value
res.observed[2]#2nd observed value
69981.0
res.trend[2]*res.seasonal[2]*res.resid[2]
69980.99999999999
Now why do we need decomposed ? -- AR , MA model perform better if data is stationary ,it was better if data is detrended .
for detrended value= observed value/trend value (for additive model observed value-trend value )
plot the value
pd.DataFrame(res.observed/res.trend).plot()
<AxesSubplot:xlabel='Quarter'>
the output basically shows that, there are no more trend avaliable, seasonality is captured and no more trend avaliable , data
become completely detrended ( previous picture data was incresing trend ). now we can use any model